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Abstract 

We study posterior rates of contraction in Gaussian process regression with unbounded 
covariate domain. Our argument relies on developing a Gaussian approximation to the 
posterior of the leading coefficients of a Karhunen-Loeve expansion of the Gaussian process. 
The salient feature of our result is deriving such an approximation in the Wasserstein 
distance and relating the speed of the approximation to the posterior contraction rate 
using a coupling argument. Specific illustrations are provided for the Gaussian or squared- 
exponential covariance kernel. 

Keywords; kernel regression; Gaussian process; Hermite polynomials; posterior contrac¬ 
tion; random design; Wasserstein distance 

1 Introduction 

Gaussian process (GP) priors [23] are popularly used in a variety of machine learning ap¬ 
plications including regression, classification, density estimation, latent variable modeling. 
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unsupervised learning to name a few. GP priors also share a deep connection with frequen- 
tist reproducible kernel Hilbert space (RKHS) based regularization methods; see, for exam¬ 
ple Chapter 6 of [23]. Paralleling the development of scalable algorithms for GP regression, 
there has been substantial progress in recent years in understanding frequentist properties 
of the posterior arising from a Gaussian process prior. A standard way of evaluating fre¬ 
quentist properties of Bayesian procedures is to consider whether the amount of posterior 
mass assigned to a neighborhood of the true data-generating parameter (a function in the 
present setting) converges to one with increasing sample size. If the neighborhood size is 
fixed, the above phenomenon is termed posterior consistency, while if the neighborhood 
size is allowed to shrink to zero, then the (best possible) shrinking rate is termed the 
posterior contraction rate. [10, 16] established posterior consistency of GP priors, while 
posterior contraction rates in a variety of contexts were derived in [5, 22, 28, 29, 31, 32] 
among others; see also [24] for an information-theoretic approach. In particular, it has 
been established in various contexts that the posterior distribution contracts at an optimal 
rate (up to a logarithmic term) in a frequentist minimax sense. 

The above references exclusively deal with compactly supported functions as parame¬ 
ters, even though the priors in principle are random functions on full Euclidean spaces. In 
fact, the influential article [31] remarks that 

“ Consistency of a posterior on the full space can he expected only if the tails of the 
functions are restricted. If they are not, then one would still expect that the posterior 
restricted to compact subsets contracts at some rate. At the moment there seem to exist 
no results that would yield such a rate (or even consistency)”. 

In this article, we take a step towards addressing this question borrowing inspiration from 
the kernel regression literature [13, 25], where a norm weighted by a possibly unbounded 
covariate density is commonly used as a measure of discrepancy. We focus on the nonpara- 
metric regression model with Gaussian errors 

Yi = f{Xi) + ei, ei~N(0,cj^), i = l,...,n, (1) 

where Xj G A are covariates and f : X —)■ M is an unknown regression function with 
possibly unbounded domain X C which is assigned a zero-mean GP prior. We operate 
in a random design setting where the covariates are drawn according to a distribution 
p on X and study contraction of the posterior in an LP‘{p) norm, i.e., the norm on 
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weighted with respect to the covariate density p. This choice ensures that the large 
covariate values are weighted down, which can be considered as a way of restricting the 
tails of the function as in the comment by [31] above. 

In deriving the posterior rate of contraction, we expand the GP prior via a Karhunen- 
Loeve expansion [2] and then derive a Gaussian approximation to the posterior distribution 
of the leading coefficients of the expansion. The Gaussian approximation is derived in an 
L? Wasserstein metric which is particularly suited for the present situation for reasons 
described in the sequel. Using a careful coupling argument, the speed of such Gaussian 
approximations are related to the posterior contraction rate; a result which is new to best 
of our knowledge. Another key ingredient of our method is to control the effect of the 
truncating the Karhunen-Loeve expansion in the posterior. This typically requires bounds 
on the concentration of the prior around the true function in the sup-norm, which is difficult 
to control for unbounded covariates. A second contribution of this paper is to develop a 
general result (Theorem 3.4) to bound (with high probability) the integrated log-likelihood 
ratio from below by a quantity involving prior concentration around the true function in 
the L?p norm instead of the sup-norm. We believe this result may be of independent interest 
in random design Gaussian regression. We may comment here that in addition to dealing 
with unbounded covariates, the proposed technique has an added advantage of making the 
bias-variance tradeoff in the posterior explicit as in kernel ridge regression theory. 

While we make general assumptions on the covariance kernel to prove our results, 
verifying them in a specific context requires suitable control over the eigenfunctions of 
the kernel. This can potentially be a non-trivial exercise, in particular if the covariance 
kernel involves a parameter which is sample-size dependent. We illustrate this in case 
of a squared-exponential kernel, for which explicit expressions of the eigenfunctions are 
available [23]. We develop precise bounds on the eigenfunctions making the role of a scale 
parameter explicit, which should be more broadly useful. 

2 Preliminaries 

For a square matrix B, tr (B) and \B\ respectively denote the trace and the determinant 
of B. If B is positive semi-definite (psd), then let B^/'^ denote its unique psd square-root, 
so that (H^/^)^ = B. B is positive definite (pd) if and only if is pd [4], and in such 
cases we can unambiguously define Given two pd matrices Bi and 


3 



B 2 , we write Bi ^ B 2 if Bi — B 2 is psd. For a p x d matrix A = {ajji) with p > d, the 
singular values of A are the eigenvalues of {A^A)^/"^. We shall use Smax(^) and Smin(^) to 
denote the largest and smallest non-zero singular values respectively; the condition number 
k{A) = Smax(^)/'Smin(^)- The Frobenius norm (|| • ||^) and the operator norm (|| • II 2 ) are 
dehned in the usual way, with ||^||^ := y^tr {A'^A) and ||^||2 := ■S max l^)- Note that 

For a vector x G ||x|| will denote its Euclidean norm. Let (.2 = {d = {Oi, O 2 , ■■■)'■ 
< 00 } denote the space of square-summable sequences, with || 0 ||^^ = 

Let ©o = {6 £ £2 ■ < 00 } denote the Sobolev space of sequences with 

“smoothness” a > 0, and denote the Sobolev norm ||0||^ = (X]j=i Tor a den¬ 

sity p on X £ let L‘^p{X) = {g '■ j g{x)‘^p{x)dx < 00 } denote the space of square- 
integrable functions with respect to p. L‘^p{X) is a Hilbert space under the inner product 
{gi-,g 2 ) = / gi{x)g 2 {x)p{x)dx; the resulting norm will be denoted by || • || 2 ^, so that 
Mll^p = ! g{xfp{x)dx. 

Throughout C, C", Ci, 6 * 2 ,... are generically used to denote positive constants whose 
values might change from one line to another, but are independent from everything else. 
< / > denote inequalities upto a constant multiple, a x 6 when we have both a <b and 
a^b. 

2.1 The LP Wasserstein distances 

Given two probability measures P and Q on the total variation distance d^viP, Q) ■= 
sup^ l-P(^) — <5(^)1 where the supremum is over all Borel subsets of and the Kullback- 
Leibler divergence D{P\\Q) are defined in the usual way. For p > 1, the IP Wasserstein 
distance with respect to the Euclidean metric (henceforth Wp in short), denoted dw,p{P, Q), 
is defined as 


dwAP^Q)= inf (E||X-y|ni/p, (2) 

joint (P.Q) 

where joint(P, Q) denotes all random vectors (X, Y) £ x such that X P,Y 
Q. The Wasserstein distances have their origins in the problem of optimal transport; 
refer to [14, 17] for background and properties. Explicit expressions are available for 
the H 2 distance between two d-dimensional Gaussian measures. In particular, if P = 
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Nd(;Ui, Si), Q = Nrf(/i2, S2) and S1S2 = S2S1, then 


dlr^2{P.Q) = \\^^l - M 2 II' + (3) 

For d = 1, the VF 2 distance is identical to the Frechet distance [12], 

3 Posterior contraction in random design GP regression 

Write the nonparametric regression model (1) in vector form as 

Y = F + e, e~N( 0 ,c 72 l„), (4) 

where Y = (W,..., YnY and F = (/(Xi),..., f{Xn)Y- We shall assume the error variance 
Y to be known throughout this paper. Let /o : X —)■ M denote the true data generating 
function and define Fq = (/o(Xi),..., fo{Xn)Y- 

As mentioned in the Introduction, we operate in a random design setting where we 
assume that the covariates Xi are independent and identically distributed according to a 
known density p on X and L) | Xj ~ N(/o(Xj), Y) independently for i = 1,..., n. Letting 
h{y,x) = N(y | fQ{x),Y)p{x), the true joint density of (X, X) is an n-fold product of h. 
We shall use Eq to denote an expectation with respect to the true joint distribution of 
(X, X); Ex and Eo|x will respectively denote an expectation with respect to the marginal 
distribution of X and the conditional of X given X. Similarly, Eq) and Eo|x will denote 
probabilities under the respective distributions. 

Consider a GP(0 ,cj^X) prior on /, where X(-, •) is a positive definite correlation func¬ 
tion, i.e., K{x,x) = 1 for all x £ X. We shall generically use IT and n(- | X, X) to 
denote the prior and posterior distribution of /. Under suitable regularity conditions, 
Mercer’s theorem [2] guarantees that the kernel K admits an eigen-expansion of the form 
K{x,x') = ™ where is an orthonormal system in Yp{X) 

(f (/)j(x)^i(x)p(x)dx = 5ji) and {Aj} the corresponding non-negative eigenvalues, which 
satisfy 

J K{x,x')4'j{Y)p{x')dx'= Xj4>j{x), j = l,2 ,... (5) 

As a concrete example, consider the squared-exponential kernel Ka{x,x') = exp(—a^||x — 
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x'\\ ) indexed by a length-scale parameter a. For Gaussian covariate distributions p, explicit 
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expressions for the eigenfunctions and eigenvalues are known [23]. Specifically, when the di¬ 
mension d = 1, with a Gaussian covariate density p{x) = y^26/7r and c = \/P~+~26^, 


(l)j{x) = 


(cjhYl^ 






26 


1/2 


i -1 


( 6 ) 


2 /-/fe 2 1 

where H]^{x) = (—l)"'e* , /c = 0,1,... denote the Hermite polynomials^. We shall 

return to the squared-exponential kernel in Section 4. 

By the Karhunen-Loeve Theorem [2], the GP itself can be expanded as 


fix) = Zj(t)j{x), 

i=i 


(7) 


where Zjs are i.i.d. N(0,1). If the series representation above is truncated to the first k 
terms and the resulting random function is denoted by /*, then it follows from (5) and the 
orthogonality of the eigenfunctions (pj that E'^f — /*|| 2 p = Yl’^=k+i ■ 'T’he accuracy 
of the truncation relies on the rate of decay of the eigenvalues, which is related to the 
smoothness of the GP. For example, if the sample paths of a GP are infinite smooth, 
then the eigenvalues decay exponentially fast, so that relatively few leading terms in the 
expansion (7) offer a close reconstruction of the original process. 

Given a GP(0,cj^ii') prior, we shall consider such truncations of (7) to dehne priors 
which we refer to as truncated Gaussian process (tGP) priors; 


kn 

fix) = '^ejcj)j{x), Oj ~ N(0,o-^Aj). (8) 

i=i 

Let d* = [9i,... ,9k„)^ denote the /c,i-dimensional vector of coefficients in (8) and A = 
diag(Ai,..., Afc„), so that d* ~ N(0, ct^A). One may consider the tGP priors (8) as sieve ap¬ 
proximations to the original GP prior, where the basis functions (pjS and the prior variances 
AjS are determined by the choice of the kernel K. We denote such priors by tGPfc„(0, AT); 
the truncation level kn will be suppressed when clear from the context. 

We note here that the tGP prior is solely introduced to obtain theoretical understanding 
of the original GP prior and the resulting posterior. When working with a tGP prior, one 
can conveniently direct attention to the coefficient vector 0*, which is finite-dimensional; 

^Many references term HkS the “physicist’s Hermite polynomial” to distinguish from the “probabilist’s 
Hermite polynomial” hk{x) = 2~'°^'^Hk{xl\/2) 
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albeit with the dimension possibly increasing with sample size n. In fact, defining the 
nxkri (random) matrix <I> = one can write model (4) equipped with 

a tGP prior (8) as 

y ~N(F,a2l,,), F = ^9\ 0'~Nfc„(O,a2A). (9) 

Using standard Gaussian conjugacy, the posterior distribution of 9^ under (9) is 

W*(-I X,y) = N(0,S), 0 = + S = 

From (10), the posterior distribution of F = ^9^ is also Gaussian. With a slight abuse 
of terminology, we shall refer to the posterior distribution (10) of 0* as the tGP posterior 
induced by the tGP prior. The role of the tGP in deriving posterior rates of contraction 
for the original GP prior is made precise through the following general rate theorem for 
GP priors. We first state our assumption regarding the true data generating function /o 
and introduce some notations. 


(Tl) The true data generating function /o G Lp(A’), so that /o = 0Oj4>j with 9oj = 

ifoAj) ■= //o( x)(j)j{x)p{x)dx. The convergence of the infinite sum is in an sense, 
i-e., ||/o - E/=i S<^i|| 2 ,p 0 as J oo. 

Define = (*^ 0 j)i<i<fc„ G 1^^"- Also define H^qIIh ~ ^ ~ 

Theorem 3.1. Consider model (1) with a GP prior f ~ G'P(0, a'^K), where the kernel K 
has eigenfunctions {4>j} and eigenvalues {Aj} with respect to the covariate density p as in 
(5). Assume the true function /o satisfies (Tl). For kn < n, let W^{- \ X,Y) denote the 
tGP posterior as in (10). Let Cn ^ 0 be a sequence with ne^ —)• oo and ||/o — /o|| 2 p ^ ^n- 
Then, for any M > 0, 


Eon(||/ - /o|| 2 _^ > Men \Y,X)< Ti, + (11) 

where 


En 


fX)d^2 


Tin — 


T2n = Eon( f-f 


W‘(- |y,A),Nfc„(0‘,^h 


\2,p 


M 24/4 

> Men \Y,X). 


P{xl>M^nel/4)+VxiA^n), 


( 12 ) 

(13) 


In (12), Xr denotes a random variable with r degrees of freedom and An C Af" is any 
set in the a-field generated by Xi, 
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It immediately follows from (11) that for a given en, if the sequences Tin, T 2 n —^ 0, then 
tn is an upper bound to the posterior contraction rate [15] in the norm; note that no 
assumptions regarding the support of the covariate density p is made. Theorem 3.1 thus 
relates the posterior contraction rate of a GP prior to (i) the speed of a posterior Wasserstein 
approximation of the induced tGP prior {Tin), and (ii) the associated truncation error 
(T 2 n). To obtain the best possible rate out of Theorem 3.1, one needs to choose the 
truncation level kn (and to a lesser extent the set An) in an optimal fashion. The role 
of these quantities will become more explicit once we provide manageable bounds to Tin 
and T 2 n in the subsequent sections. To that end, we need to make additional assumptions 
on the eigenfunctions and eigenvalues {Aj} of the kernel K stated below. Recall 

A = diag(Ai,..., Aa:„) and = {(t)j{Xi))i<i<n,i<j<kn- Assume 

(Al) ||A “^||2 < n/4. 

(A2) suPj,gA’ \(l>j{x)\ < Ln for all j = 1 ,..., kn, with L^^kn log kn < n. 

Assumption (Al) typically implies a bound on the growth rate of kn', for example, if the 
eigenvalues decay polynomially, Xj x for some /3 > 0 , then ||A “^||2 = A^J x k^ 
and hence (Al) is satisfied for all kn ^ Assumption (A2) is readily satished 

if all the eigenfunctions (jij are uniformly bounded in magnitude by a constant. However, 
(A2) is more general and allows the sup-norm of the top kn eigenfunctions to increase with 
n subject to a growth condition; note that no assumption is made regarding the trailing 
eigenfunctions. Allowing the sup-norm bound to grow with n is important when the kernel 
is indexed by one or more hyper parameters which may depend on n. A specific illustration 
is provided in the context of the squared-exponential covariance kernel ( 6 ) in Section 4. It 
turns out a non-trivial exercise to bound the eigenfunctions ( 6 ) making the dependence on 
the bandwidth parameter a explicit. 

3.1 Wasserstein approximations to tGP posteriors 

To bound Tin, one primarily needs a handle on the squared W 2 distance between the tGP 
posterior >V*(- | T, Af) in (10) and a Gaussian (0g, cr^Ifc^/n) distribution. Inspecting 
the proof of Theorem 3.1, it may seem a more obvious choice for Tin is 

W*(- I y, A),Nfc„ ( 0 *, + F(xl > MVn/4), 
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where c^tv denotes the total variation distance. However, such approximations in the 
total variation distance require a prior flatness condition [7] which is not satisfied by the 
tGP priors. We find that bounding the W 2 distance between the tGP posterior and the 
asymptotic Gaussian distribution is less demanding in the present setting compared to the 
total variation distance. However, the connection between such an approximation result in 
the W 2 distance and posterior contraction rates in the norm is not immediately clear. 
We devise a coupling argument to relate the two quantities in the proof of Theorem 3.1. 

The 1 a„(-T) term in Tin is introduced as a technical device to control the expectation 
of the squared Wasserstein distance on y4„; we appropriately choose in a way so that 
receives vanishingly small probability under Px- Before proceeding further, we settle 
with a choice of An in the following Lemma 3.2. 

Lemma 3.2. Assume the eigenfunctions {4>j} of the kernel K with respect to the covari¬ 
ate density p satisfy (A2). Define An = {||<1>'^‘1> — nlfc „||2 < n/2}. Then, Fx{A^) < 

kne-Cn/iknLi) ^ 


Remark 3.1. On the set An, ‘h satisfies 

2k 

< 3n/2, > n/2, < 3, tr [(^>'^^>)“^] < —(14) 

Lemma 3.2 follows from a measure concentration phenomenon which under appropriate 
conditions on the summands ensures that a sum of independent symmetric random ma¬ 
trices is concentrated around its expectation with high probability. We can write = 
with ())d) = [(l)j[Xi))i<j<kn £ independent for i = 1,..., n. Using the 
orthonormality of the eigenfunctions {(j)j}, = f 4>j{x)4>iix)pix)dx = Sji 

and hence = nlk„- We specifically apply a version of matrix Bernstein inequality 

[27] to prove the concentration of 4* around the proof is deferred to Section 5. The 

sup-norm bound on the eigenfunctions (fjs in Lemma 3.2 is used to bound the operator 
norms of the matrices (/>(*) (^d))"^. 

We are now in a position to state our approximation result in the W 2 distance that 
provides a simple bound to the first term of Tin in (12). Recall 6*Q,/g from (Tl). 

Theorem 3.3. Assume the true function fo satisfies (Tl) and the eigenfunctions {4>j} 
and eigenvalues {Aj} of the kernel K with respect to the covariate density p satisfy (Al) 
and (A2). Let An G he the set defined in Lemma 3.2. The tGP posterior W*(- | Y,X) 
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from (10) satisfies 


Eo 


w*(-|y,x),Nfc 


a 


n 


<a^^ + 

r\_/ ' 

n 


at 2 
^0 H 

n 


+ ||/o -/olla,^- (15) 


While /o is only assumed to be an element of L‘j^{X) in Theorem 3.3, additional smooth¬ 
ness assumption can be utilized to obtain more precise bounds on the truncation error 
||/o — / 0 II 2 P ™ (15)- The bound (15) indicates a typical bias-variance type tradeoff: in¬ 
creasing the truncation level kn will improve the truncation error ||/o — /oH^p; however 
at the expense of the first two terms increasing. Typically, if /o is a-smooth, then the 
first two summands contribute a kn/n factor and the truncation error is of the order k~‘^°‘; 
with kn/n -|- k~'^^ attaining its minimum when kn = The H^qIIh 1®^™ 

considered an RKHS type penalty; indeed, it is the RKHS norm of 0q relative to a N(0, A) 
distribution [30]. 


3.2 Handling the truncation error 

We now focus attention on the term T 2 n in (11). To this end, we rely on a standard 
argument in Bayesian nonparametrics: if the prior probability of a set is exponentially 
small, then its posterior probability converges to zero. Such an argument is commonly 
used to derive upper [15] and lower [9] bounds to the posterior convergence rate. However, 
a crucial ingredient for the above argument to work is to obtain suitable lower bounds to 
the log-likelihood ratio integrated with respect to the prior. The only such result that we 
are aware of in the random design setting is from [32], who derive a bound for the empirical 
L 2 norm and then use a functional Bernstein inequality to extrapolate to the norm. 
Their result requires the prior draws from the GP to be bounded with probability one, 
which may not be the case for non-compact covariates. In Theorem 3.4 below, we develop 
a general result to bound (with high probability) the integrated log-likelihood ratio from 
below by a quantity involving the prior concentration around the true function in the 
norm. A proof of Theorem 3.4 can be found in Section 5. 

Theorem 3.4. Recall F = (/(Xi),.. .,f{Xn)r,Fo = (UX,),..., MXnW andXi,.. .,Xn 
are independently and identically distributed according to the density p. For p, £ M"", let 
) denote the N„(/i, I„) density. Let li he a prior on and in ^ 0 be a sequence such 


10 



that ne^ —)■ oo. Then, 

Pn,F{y) 




log(ne: 


( 16 ) 


Pn,Fo{Y) ■■-- / y'ne^ 

Using Theorem 3.4 along with a standard argument (see, for example, Theorem 2.1 of 
[15]), we can bound 

(17) 


^ n(||/ > M^el) log{nel) 

J 2n S „rC>„/ll „ m , - s + o - 


'net 


e—ln(||/-/o||2_^<e„) 

Using Theorem 3.3 and Theorem 3.4, we arrive at the following corollary to Theorem 3.1. 

Corollary 3.5. Consider model (1) with a GP prior f ~ GP{0,a^K). Assume the true 
function /o satisfies (Tl). Let Cn —)■ 0 satisfy ne^ —)■ oo. Let kn < n be such that 

(CO) The eigenfunctions and eigenvalues of the kernel K with respect to 

the covariate density p satisfy (Al) and (A2). 

(Cl) max{/cn, ll^'olle^ = 0(^4)- 
(C2) ||/o-/*||2^ = o(e2). 

(C3) There exists a sequence in ^ 0 with ne^ —)■ oo such that 

U{\\f - Mhl) 


0 . 


e-nPnni\\f-fo\\,^^<in) 

Then, for a large constant M > 0, 

Jhn Eon(||/ - /o||2^^ > Men \Y,X) = 0. 


(18) 


(19) 


Proof. The quantity in (19) is bounded by Tin + T 2 n from Theorem 3.1. Invoking Theorem 
3.3, 

max{A:„, ||/o “/o|| 2 ,p 


Pn < 


+ 


+ Pixt>M^nei/A)+Fx{A'^n)- 


nCn Cn 

The first two quantities in the above display converge to zero by (Cl) and (C2). By (Cl) 
and a standard deviation inequality for chi-square distributions, P{xi„ > M^ne^/4) —)■ 0 
for M > 2. By Lemma 3.2, Px(A(j) < —)■ 0 by (CO). The proof is completed 

using the bound (17) for T 2 n- □ 
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In (18), the prior tail probability in the numerator n( \\f-f‘\\" > "'"4) = AjZ| > 

M^e^), with ZjS i.i.d. N(0,1). Using a version of Bernstein’s inequality for sub-exponential 
random variables (Proposition 5.16 of [33]), one can suitably bound this probability. Sec¬ 
ond, the prior concentration in norm in the denominator n(||/ ~ /o|| 2 p — ^n) = 
n(||0 - Oo\\^^ < in) with 6 j r\j N(0,Aj); this can be bounded from below using Ander¬ 
son’s inequality (Lemma B.2 in the Appendix). We provide specific illustrations of these 
arguments for the squared-exponential kernel below. 

4 Application to the squared-exponential kernel 

As a non-trivial application of the general results in the previous section, we consider Gaus¬ 
sian process regression with a squared-exponential kernel Ka{x,x') = exp(—a^||x — x'||^); 
a popular choice in machine learning applications. It is well-known that the realizations 
of a GP with squared-exponential kernel are infinitely smooth and hence are not suitable 
to model rougher functions. It has only been recently understood [28] that the parameter 
a plays the role of an “inverse-bandwidth”, and scaling the parameter a with the sample 
size enables better approximation of rougher functions. [28] motivates this from a rescaling 
perspective; choosing a large value of a is equivalent to tracing the trajectory of a smooth 
process (with a = 1) over a larger domain, incurring more roughness. In the regression 
context (1), [28] derived optimal posterior convergence rates in the empirical norm 
using a rescaling a = Un = where the true function is a-smooth on a compact 

domain in M. Using a gamma prior on a, [31] extended their result showing that the rate 
of contraction is adaptive over any a-smooth compactly supported function. In a more 
recent article, [22] extended the results in [28] for integrated norm. All these articles 
make exclusive use of the reproducing kernel Hilbert space theory from [30] and bounds on 
sup-norm small-ball probabilities of Gaussian processes over compact domain [19, 20, 21]. 

The eigen-expansion of the squared-exponential kernel offers a complementary per¬ 
spective into the rescaling phenomenon. Consider the expression for the eigenvalues of the 
squared-exponential kernel in (6). It is well known that the rate of decay of the eigenvalues 
is closely connected to the smoothness of the process (7). When a = 1, the eigenvalues 
Xj decay exponentially fast in j, indicating the infinite smoothness of the sample paths. 
Although the rate of decay remains exponential in j for any fixed value of a, it is effectively 
slowed down for large values of a; see Figure 1 for an illustration. 
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Figure 1: The top 50 eigenvalues Xj of the squared-exponential kernel in (6) plotted against 
the index j for 4 different values of a. Left panel: the index j runs from 1 to 20. Right 
panel: j runs from 21 to 50. With increasing a, the rate of decay is slowed down. 

In this section, we apply the results developed in Section 4 (specifically Corollary 3.5) 
to derive posterior rates of contraction for the above rescaled GP priors with the covariates 
drawn i.i.d. from a Gaussian density on the real line. To best of our knowledge, no existing 
posterior contraction rate result for the squared-exponential (or other) kernel allows un¬ 
bounded covariate support. Using a tensor-product basis approach, it is possible to extend 
our results to covariates in 

4.1 Posterior contraction rates 

For the remainder of this Section, {(pj} and {Aj} denote the eigenfunctions and eigenvalues 
(6) of the squared-exponential kernel with inverse-bandwidth parameter a; the dependence 
on a is suppressed for notational convenience. In order to apply Theorem 3.5 to the 
squared-exponential kernel, we need sup-norm bounds on the kn leading eigenfunctions 
4>jS. Since we are concerned with rescaled processes where the parameter a is sample-size 
dependent, it is important to precisely characterize the role of a in the bound. 

A well-known inequality for the Hermite polynomial is Gramer’s bound [26], which 
states that for any I > 1, \Hi{z)\ < CV2^ l\ for all 2 ; G M, where C < 2 is a global 
constant which doesn’t depend on 2 ; or /. A direct use of this bound leads to \(pj+i{x)\ < 
which is clearly not sufficient as we are dealing with unbounded covariates. 
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Since the Hermite functions are polynomials, the exponential bound provided by Cramer’s 
inequality is wasteful in the tails. We derive a bound for the leading eigenfunctions 4>jS in 
Lemma 4.1 below; refer to the Appendix for a proof. We did not find an existing reference 
proving this result. The main idea is to use Cramer’s bound in a neighborhood of the origin, 
while for suitably large values of x, use a combination of Cramer’s bound with a different 
bound obtained by exploiting an integral representation of the Hermite polynomials. 

Lemma 4.1. Let (pis be the eigenfunctions of the squared-exponential kernel as in (6). 
Then, maxo<j<fcsup^-gR \(J)j+i{x)\ < for large a. 

We are now in a position to state the rate theorem. Set an = in (6). We 

define the true class of functions J- with “smoothness a” as linear combinations of the 
eigenfunctions (fj with the coefficient vector in the Sobolev class 0 q.. Formally, 

OQ 

^={fo-fo = Yl ^0 = {001, 002 ,...) e ©a}. (20) 

i=i 

Theorem 4.2. Consider the nonparametric regression model (1). Assume the covariates 
Xi are drawn i.i.d. from a Gaussian density p{x) = 2bjn and the true function 

fo^Tas in (20) with a > 1/{4(1 — 26)}. Let f ~ GP{0,a^K) with squared-exponential 
covariance kernel Ka{x,x') = exp(—a^|x — x'p). Choose a = an = Then, an 

upper bound to the posterior contraction rate (19) in norm is Cn = logn. 

Remark 4.1. From [28], the rescaling an = n^A'^o‘+^) ig optimal choice for a smooth 
functions on a compact domain and leads to the optimal rate up to a logarithmic 

term. Theorem f.2 obtains a similar result for non-compact domains in a random design 
setting. The lower bound on the smoothness a is typically necessitated in random design 
settings; see for example, [6, 8]. In particular, when 6 = 1/4, so that p corresponds to the 
standard normal density, we require a > 1/{4(1 — 26)} = 1/2. 

5 Proof of main results 

Proof of Theorem 3.1 

Using triangle inequality ||/ - fo\\^^^ < ||/* - +\\f - f\\ 2 ,p + l|/o - fo\\ 2 y and since 

ll'^o “ /olkp ~ assumption, we can bound n(||/ - fo\\2 p > Men \Y,X) < n(||/* - 
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/q||2p > Men I y-iX) + n(||/ — /*|| 2 p > | Further, using the orthonormality 

of the eigenfunctions, n(||/* — /o|| 2 p > Men \ — 0q|| > Men \ F",X). 

Therefore, taking expectation, 

Eon(||/-/o|| 2 ,^>Me„ I Y,X)<EoW\\\e^-el\\ > Men \ Y,X)+T 2 n. (21) 

Let Un = {||0* - 0o|| < Men}. We shall show below that EoVV*(C/^ \ Y,X) < Tin, 
which will complete the proof of the theorem. For any An C T” in the u-field generated 
by Xi,..., Xn, bound 

\Y,X)< EoW\K I y,X) UJX) +Px(^^), (22) 

We now elucidate a coupling argument to bound the W^{U^ \ Y, X) term in (22). Given 
(Y, X), let {9t, 9a) £ be a pair of random variables such that 9t ~ Qt = W’*(- | 

Y,X),9a ~ Qa = N(0oW^Ifc„/^) and — 0 a||^ = 2 {Qt, Qa), where E denotes an 
expectation with respect to the joint distribution of {9t,9a) given Y,X. In other words, 
{9t,9a) £ joint((5r, Qa) are optimally coupled, i.e., the infimum in (2) is attained by 
{9t,9a). Such an optimal coupling can be always constructed in general; see [17] for a a 
constructive proof for normal distributions. We then have 


W\u:\Y,X) = P{9t(^K) 


< P{9t e U^, \\9 t - 9a\\ < Men/2) + P{\\9t - 9a\\ > Men/2) 

AE\\9t-9a\? 


<P{\\9a- 91\\> Men/2) + ^^,2 

^d/^2iQT, Qa) 


= P{xi > M^nei/A) + 


M^el 


(23) 


(24) 


In the above display, the first line simply uses that the marginal distribution of 9t is 
W*(- I Y,X) by construction. From the first to the second line (23), we use a union bound. 
For the first term in (23), we first use triangle inequality to conclude that if 9t G U!/, i.e., 
II^T-^oll > Men, and ||0 t-0a|| < Men/2, then \\9a-91\\ > \\9t-91\\-\\9t-9a\\ > Men/2. 
Next, by construction, (9a — 9q) | Y, X N(0, fj^/nlfc^), which implies F’(||0 a — ^oll > 
Men/2) = PiXk„ > M^ne^/4). For the F’(||6*^ — ^qH > Men/2) term in (23), we first use 
Markov’s inequality, and then exploit the fact that {9t,9a) are “optimally coupled”, i.e., 
~ ^a|| = d^ 2 {QT, Qa). This leaves us at (24). Finally, substituting the bound (24) 
in (22), we have 


En 


EoW\U/^ I FW) < 


.(^)d^ 


2 

W,2 


W‘( 


Y,X),Nk„ 
M^el/4 




+nxi 


> M^nel/A)+Ex(Al). 
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The quantity in the right hand side in the above display is Ti^, and the theorem is proved. 


Proof of Lemma 3.2 Remark 3.1 

We make use of the following version of a matrix Bernstein inequality from [27]; let Zj, i = 
1 ,..., n be a sequence of independent self-adjoint dxd matrices with KZi = 0 and ||.^i ||2 < 
B almost surely for some B > 0. Let = || X]r=i ^ > 0, 

Set (/)W = {(j)j{Xi))i<j<k„ G and Zi = (/)W((/)W)t - 4 ^, so that 

The ZiS are independent symmetric matrices with "&xZi = 0, since from the orthonormality 
of the eiegnfunctions = f (pjiY^PiY)PiY^x = 6ji. We also have 

\\Zi \\2 < 1 + = 1 + Z]i=i Y‘ji^i)\ < 1 + knLn Y Therefore, the conditions 

for applying (25) are satisfied. 

We have Zf = _ 2 ,^W (</>(*))t + I;-^ ^ + Ifc„ ^ 

knLlY^HYY^ + lk„, so that ||lEZj^|| < knL"^ and hence by triangle inequality, rf < nknlY- 
Substituting t = n/2 and B = A:„L^ in (25), we have 

Px(||^>'^^ - ralfc„|| >n/2)< fen exp ^2 + ^^/e ) - 

since jf + Bn/6 < nknL//^ + nknL‘^/6 < CnknL‘/^ and is increasing in x. 

Remark 3.1 follows, since on An, 

(i) using triangle inequality, ||<I>'^<h ||2 < 3n/2. 

(ii) using Lemma A.l (ii), Sniin(‘h'^<h) > n — — nlfc„|| > n/2. 

(hi) using (i) and (ii), K(<l>'^‘h) < 3. 

(iv) tr [(l>'r$)“i] < fe„||($'^l>)“i ||2 = fen/smin(^'^^) < 2 fe„/n. 

Proof of Theorem 3.3 

Given Y,X, recall that Qt and Qa respectively denote the probability measures W’(- | 
Y, X) = Nfc^( 6 *, S) and Nfc^(0Q, cj^Ifc„/n). By the tower property of conditional expectation, 

¥.o[lAYX)dl,YQT,QA)\ =^x[lAYX)^0\xdw,2{QT,QA)\. (26) 
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Since S and cj^Ifc^/n (trivially) commute, apply (3) to write 


dl.4QT,QA) = ||0 - 0^11' + ||sV2 _ (27) 

Thus, 

Eo|xd^, 2 (QT,gA) =Eo|x|| 0 -^o||" + (28) 

since the second term does not involve Y. We now proceed to bound each of these two 

terms in (28) on the set An- To that end, we shall apply Lemma 3.2 and in particular, 

the consequences of Lemma 3.2 summarized in Remark 3.1 multiple times below. We also 
make use of Lemma B.l on multiple occasions. 

Recall 6 = ($'^<1> + A“^)“^ and define By = (<I>'^<l>)“^<h'^T. Using ||a + < 2(||a||^ + 

|| 6 ||^), bound 

^o\x{0-9lf < Eo|x||0 - 0y||' + Eo|x||0y - ^oH'- (29) 

Let us first deal with Eo|x||^y - ^o||^- Let Fq = {fo{Xi),..., fo{Xn))'^, so that Fq = 
Eo|x^- By (Tl), we can write Fq = where R = (Fi,..., Fn)"^ with Ri = 

foiXi) - /*(W). Write Eoixll^y - = Eo|x||^y - Eo|x^y||' + ||Eo|x^y - The 

first term Eo|x||^y ~ To|x^y||^ = (T^tr [(<h'^d>)“^] < o^kn/n on An- For the second 

term, write Egjx^y = (<h'^<h)“^<l>'^Fo = 0 q + (<h'^<k)“^d>'^F, so that ||Eo|x^y — ^o||^ ~ 

j|($Tj,)-l.fTfl||2 < ||(W)-l$T||2||^j|2 < ||4>’'||2/s„l„(4’'‘f) = 

(<l>'^<h), and the last quantity is bounded above by a constant multiple 
of l/\/n on An- Therefore, 

1aAX)^o\x\\9y - 0^11" < (T^kn/n + \\R\\^/n. (30) 

We now handle the Eo|x||^ ~ ^^11^ term in (29). Using A^^ — A^^ = 7l^^(yl2 — yli)A^^, 
write Oy — 9 = BY, where B = (<k'^<h)“^Ad>"^ with A = A“^($'^<1> + A“^)“^. Using a 
standard result for expectations of quadratic forms, 

Eo\x\\0-0Yf = \\BFo\f + a^B\\l. (31) 

II 112 II 112 

So the goal now is to bound each one of ||FFo|| and ||F||^ on A„. We have BFq = 
($T^)-iA($T^) 0 t^(^T^)-iA$TF= where = (^^^T 

A-i)-i($T^) 0 t^ Bound ||FFo|f < 2(|| ($t$)-iA- 10*||^ + ||(d>'r$)-iAd>'rF||^). Bound 
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||A ||2 < ||A“^|| 2 /{smin(^^^>) “ ||A“^|| 2 } < 1 011 A„, siiice ||A-1||2 < n/4 by (Al). There¬ 
fore, ||(<h'^<h)“^A<h'^i?|| < ||<h'^||/smin(<f*'^<h)||i?|| < 11 i?| I/a/A on A„, using an argument as in 
the paragraph after the display (29). Next, || (<h'^<h)“^A“^0Q|| < {|| A“^/^||/smin(‘^"^‘h)} || A“^/^0 q|| < 
||A-V20^||/^ 

on An- After some manipulation, we can write Oq = Oq — A^0q, so that 

||A-'/20g|| < ||A-V20i|| + ||A-V2AX|| 

< l|A-'/XII + ||a-‘/2(4.-$ +a->)-'a-‘/2||Ja->A9<|| < ||a-V2«<,|| = ||«<||^, 


since ||A i/2((|)T<^ _)_ A ^A ~ 11^112’ already know is < 1 on An- Thus, 

we conclude that ||i3Fo||^ < ||i?||^/n-|- H^oIIh/’^ Finally, 


B 




< 


I„j|A||^||#^||^ ^ fe„ 


on An, since we have already shown that ||A ||2 < 1 and ||‘h'^|| 2 /smin(‘h^<h) < ^l\fn on An- 
Substituting all the inequalities in (31), 


U„(A)Eo|x||0-0y|r< ||i?H>+||0^||H/^ + cT"W^. 
Substituting the inequalities obtained in (30) and (32) in (29), 

I kn 


iAn{x)^o\x{e - elf < + 

' n n 


n 


(32) 


(33) 


Now we consider the term ||T^/2 _ in (28). Recalling the expression of S, 

= (T(<h'^<h + A“^)“^/^, and since n/4 < Sjnin(4>'^<h-|-A“^) < ||<h'^$-|-A“^|| < 2n on An, 
all eigenvalues of S are of the form Coj^pa on An- Since the squared Frobenius norm of a 
matrix is the sum of the squared eigenvalues, we conclude that ||E^/2 _ \\p < cr'^kn/n 

on An- This, in conjunction with (33), when substituted in (28) yield 

lAPX)Eo\xdlv,2iQT,QA) < CJ^- + (34) 

' ’ n n n 

Recall from (26) that our objective is bound the Ex expectation of the left hand side 

II 112 

of (34). The only term depending on X in the right hand side of (34) is i? and 
Ex||R||" = ^||/o ~ /ollb’ Therefore, taking an expectation with respect to Ex on both 
sides of (34), the conclusion follows. 
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Proof of Theorem 3.4 


Let 


Dn 


VnAY) 

Pn,Fo{y) 


n(4/), 


Gr, 



Following a standard argument, it is enough to show the desired lower bound on Po(-Dn > 
for any probability measure 11 supported on J^n = {f ■ ||/ — /o|| 2 p < A- By 
Jensen’s inequality, logiJ„ > Gn, so that Po(Dn > e > Po(G'n > Our goal 

below is to bound Po(G'n > —ne^) from below, or equivalently, bound Po(Gn < —ne^) from 
above. 

A simple calculation yields Gn = AxA ~ ^o) “Ugj^/2, where fj-ox = f (F — FQ)n(df) € 
M-anda2^ = /||F-Fo||"n(d/). Since T ~ N(Fo, In), we have G„ | A ~ N(-ct 2^/2, ||^ox||") 
Also, the marginal expectation of Gn, PoGn = —PoxO'qx/‘^ ~ “'uo'o/2, where cJq = 

/ ||/ “ / 0 II 2 pll((i/). Since II is supported on Fn, clearly (Tq < e^. 

The Paley-Zygmund inequality (see, for example, [11]) states that for any non-negative 
random variable Z with finite second moment and 5 G (0,1), P{Z > 5EZ) > (1 — 

{EZY/{EZ"^). In particular, if {EZY/{EZ"^) > 1 — 7 for 7 > 0 small, then 


P{Z < SEZ) < 1 - (1 - <5)2(1 - 7 ) < J + 7 . 


(35) 


We shall invoke (35) with the non-negative random variable Zn = for some tn G 

(0,1/2) and Sn G (0,1) to be chosen below. A key ingredient of such an exercise is to 
obtain a lower bound on (Eo.Z„) 2 /(EoA 2 ). 

By Jensen’s inequality, EqA^ > which implies (Eo.Z„)2 > 

We next need to bound EqA^ = from above. Since Gn \ X is conditionally Gaus¬ 

sian, we have sufficient control over the moment generating function Mg„{X) = 
for A G (0,1). Using the iterative property of conditional expectations, we can write 
Ege^'^" = Eox[IEo|x(e^‘^")]- Recalling Gn | A ~ N(-<Tojf/2, ||/Uox|f), we have 

Eo|x(e^‘^") = e-AxG G < ^ 


where the second step follows since by an application of Cauchy-Schwartz inequality, 
lll^oxll^ < Ugjf. Since A G (0,1), the quantity A —A^ in the exponent is positive. Therefore, 
by Jensen’s inequality, Eqc'^^" < G = q-G-G)AG^ In particular, for any 
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tn G (0,1/2), 'EqZ^ = Combining this bound with the previously 

obtained bound (Eo^n)^ > have (Eo^n)^/(IEo^^) > 

For a slowly decaying sequence 7 „ satisfying 7 n —)• 0 and ^nne^ —)• oo, set t^ne^ = 7 ^. 
For n large enough so that 7 „ < 1, we have (Eo^n)^/(IEo^^) > > 1 — 7 n- From (35), 

we therefore have that for any 0 < <5,^ > 1, Eo(^n < <JnIEo-^n) < (5n + 7 n- Further, 


IF’o(-^n < Sn^oZn) = 


> 


Gn< 


Gn< 


lo^ 

in 

lo^ 

in 


n ^ log Eg^n 


net 


(36) 


where the inequality follows since (logEo^n)/in > lEgGn = —na\/2 > —ne^l2. Choose 
5n so that {\og5n/in) = —ne^l2, i.e., 6n = = e"*". From (36) and the 

immediately preceding inequality, we therefore have 


Po(G'n < -net) <5n + ln = + 7n < + 7n. 


(37) 


The sequence 7 ^ is yet to be chosen; we shall do so now by optimizing the right hand 
side of (37). Consider the function g{x) = x + for B > 0. The function attains its 
minimum value on (0, 00 ) at the point x = log B/B and the minimum value of the function 
is (log B + l)/B. Therefore, choose jn = G log(nt)/ \/ ne^; note that for this choice 7 n —^ 0 
and 7 nnt —^ co; with this choice we have Eo(G'n < —ne^) ^ Clog(nt)/ \/ nt• 


Proof of Theorem 4.2 

The proof follows from an application of Corollary 3.5 to the present setting. We assume 
cj^ = 1 for this proof. Also at the very onset, we mention that we replace Xj by 
subsequently, since after some algebra, it can be shown that Xj x Recall 

an = and choose /c„ = log in Corollary 3.5. We first verify 

that (CO) - (C3) are satisfied. 

We start with (CO), which requires verifying (Al) & (A2). For (Al), we have 
||A“^|L = Ar^ X < n by choice of kn- From Lemma 4.1, we have that for 

any j = l,...,/cn, sup^-g^ |(/>j(x)| < Setting we have 

Ll^kn log/Cn ^ ^(*ba+'i/2)/(2a+l) igg ^ < n as long as a > 1/{4(1 — 26)}, verifying (A2). 

We next verify (Cl). Clearly, kn = o(ne^). So it remains to establish that || 6 *o||h ~ 
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o(ne^). Bound 


at 2 
HI 


an < On 1101 

i=i 


oii^ max 

l<i<fcn 


The function x —)• is monotonically decreasing on the interval (0, 2aan) and mono- 

tonically increasing on [2Q;a„,oo). Therefore, maxi<j<fc^ ^ gj/anj- 2 a_ 

We have l/un < 1, and hence evaluated at j = 1 can be bounded above by 

e. evaluated at j = kn is bounded above by = o(l). Hence 

||^o||e/”^n < anlne\ 0 as n oo. 

To verify (C2), we need to show that ||/o - fo\\ 2 ^p ^ ^n- Indeed, ||/o - = 

ET=k„+lfXj < fcn^ll^oll^ = 0(62)’. 

It now remains to verify (C3). As noted in the paragraph after (17), the numerator 
in (18) can be expressed as n(||/ — ^ > M^e^) = ^ with Zjs 

i.i.d. N(0,1). Noting that X]^A:„+i ^ = 7 T,- 2 «/( 2 a+i) < ^2^ 


n 


i = fcn + l 


X,Z] > M\l] < ni A,(Z| - 1 ) > Mhl/2\. 


j = kn + l 


(38) 


(Zj — l)s are mean-zero sub-exponential random variables. By an application of Bern¬ 
stein’s inequality for linear combinations of mean-zero sub-exponential random variables 
(Proposition 5.16 of [33]), 


n<| X,{Z]-l)>Mhl/2\<2e^p 

j = kn-\-l 


— C' min 


M^e: 


4 A 


M^e. 


2 A 


Y.T=kn+i Xj) 


— C min < an 


< 2 exp 

= 2 exp [ — Cmin {a„M^ log^ n, UnM"^ log^ n}] = 2 exp(—log^ n). 


(39) 


where K, C, C are global constants. The second inequality in the previous display is due 
lo ^ - (l/ 2 a„)e“ 2 fcn/“« and maxj>fc„ Aj = (l/a„)e“(^"+i)/“". 

Next, the term in the denominator of (18), n(||/ — / 0 II 2 ,<en) = W(||0-0o||,, <en), 
where 0 = ( 0 i, 02 , • • •) with Qj ~ N(0, Aj) and 0o = ( 0 oi) 0 O 2 , • • •)• Set e„ = for 

some constant C. We show below that 


W( 0-00 1^^ < In) > exp{-C'n^/(2a+i) iQg2 


(40) 
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We now establish (40). Recall Xj x Let 9^ = ( 6 * 1 ,..., 9k„)'^ and recall 6q defined 

similarly. Then, W(||0 - - '^(11^* “ ^o|P < ^1/“^) “ ^Oj? < 

e^/2). Using Y2JLk„+i^0j — ll^o||^^n^” = o{e^), the second term can be bounded below 
by '^{Y.T=kn+i ^ ^n/4) • By Markov’s inequality, 


W( e]<el/4)>l-i/el Ee]^l 

j = kn + l j = kn + l 

> 1 - 


O'n^n 


E 




j = kn + l 
Afi kjijdji 

^^> 1/2 


(41) 


for large C. We used above that Y^^=k„+i e~^/°'"dx = and 

^-k„/a„ _ ^-2a/(2a+i)^ Therefore, it is enough to show the bound (40) for Wdl^* “^o||^ — 
e^/2). By Anderson’s inequality (Lemma B.2 in Appendix), 


wdl^* 


9lf<el/2)>e- 



W{\\etf<el/2). 


(42) 


We have already shown that ||^o|Ih ~ ® ^ilroliH > g c'ni/( 2 a+i)^ Therefore, 

suffices to bound 14 ^( 1161411 ^ < Cn/^)- Recall Oj/Xj ~ xf, therefore has a density 
(\/27ra)“^a„el42“n) exp(—anel/“"x/2)l(o,oo)(a^)- Let dx denote dxi...dxk^ in short and 
set Dn = ttnlV^- Then, 


w(^e^<il/2) =D 


k^{kn + l) 


n 


exp ( — a„e'^ 6 “"Xj 72 ) 




i=i 


dx 


> 


Dr, e„ \ " kn(k„+i} 


V2 




exp 


o^tlIq 


n 

i=i 2 




u„g„V" r(i/2) 


kn /*! 


r(A)„/2) 7,^0 


exp 


ar,e' 


knlo-n 


(43) 


From the hrst to the second line, we replace j by kn, perform a change of variable and 
drop the In term appearing inside the exponent as < 1. The last equality follows from 
the Dirichlet integral formula (Lemma B.3 in Appendix). Using r(l/2) = ^/^T and the 
standard inequality (see, for example, [ 1 ]) r(Q;) < 'j2'Ke^e~'^a^~^l‘^ for a > 1 , we can 
simplify (43) to write 


W 


E- 

Z =1 


< et/2 I > 


kn(k-n + l) 




2k„ 


fc „/2 


exp - 


/t=o 


a„e 


k-nj Cin 


^Ufe„/ 2 -i^^_ ( 44 ) 
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The integral in the above display can be bounded from below by (2e ^ 
Substituting this bound and simplifying, the lower bound is 


1 fcn(fcn + l) 

—e 4<i„ g 
kn 



kn/2 


^ g—log^ n 


Combining with (42), (40) is proved. 

Finally, the ratio of the bounds in (39) and (40) converge to zero by choosing M large 
enough, completing the proof. 


Appendix 


A Proof of Lemma 4.1 


It suffices to show that for any t > 0, maxo<j<at sup^jgjg \ cl>j j^i{x)\ < for large a. For 

fixed 6, clearly c = + ‘iba? x a. Therefore, 4)o{x) = (c/ijV- < aV4, SO enough to take 

the max over 1 < j < at. Finally, since both (pj+i a-nd Hj are symmetric functions, it 
suffices to consider the supremum over x on (0,oo). 

The Hermite polynomials have an integral representation 

OJ poo 

Hj{z) = —j= / {z + itye~^ dt. (45) 

J —oo 


For z > 0, using | f fd/jj < f \ f\diJ,, we have 





J (z^ + t^yl'^e ^^dt 






(1 + 


dt. 


(46) 


Let g{t) = (1 + e clearly, logg{t) = (j'/2) log(l + t^) — z^t^I2. Differentiating, 

slogff(t) = jt/{l + t^) - zH. Setting f^\ogg{t) = 0, we have t{{l + t^) - j/z"^} = 0. 
Therefore, if > j, g{t) attains maxima at t = 0 with g{0) = 1. On the other hand, if 
< j, g{t) attains maxima at t = {j/z"^ — 1)^/^. When > j, bounding g{t) < 1 in (46), 
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we get the inequality 


a/tt Jt=0 


'T^ Jt=0 
2i+i;2i+i 1 


y/n 2 z 

= V2 2h\ 


(47) 


We record this bound below: 

Lemma A.l. The Hermite polynomials satisfy \Hj{z)\ < \f22^z^ whenever z^ > j. 

Note that the exponential term in Cramer’s bound has been replaced by a polyno¬ 
mial z^ term. When ^ = j, ignoring constants, Cramer’ bound for \Hj{z)\ is 2^l‘^y/^.e^l‘^ 
while the same from Lemma A.l is Using j! X j4+i/2g 

we see that both bounds 

give similar results when ^ x j. 

As discussed at the beginning, we now proceed to establish the bound for \f)j+i{x)\ 
for 1 < j < at and x > 0. For x G (0, \/t), use Cramer’s bound to obtain |())j_|_i(x)| < 
(c/6)^/^ when x G (0, \/i). 

When X > \/t, setting z = \f2cx, we have z^ > 2ct > j for any j < at. Therefore, 
we have two bounds for \Hj{z)\: (i) |iLj(z)| < \/2^jl from Cramer’s bound, and (ii) 
mz)\ < 2^z^ from Lemma A.l. Using a combination of both delivers a tighter bound for 
\4>j+i{x)\. Let (5 > 0 be such that c6 > b. Then, for any such 5, we may write 

\H,{z)\ = \H,{z)\^-^\H,{z)\^ 

Substituting this bound in the expression for fj+i, we have 

ky+iWI < (=/'>)“''* 

The function x —)• for x > 0 achieves its maximum at x = [j5/{2{c6 — 6)}]^'^^. 
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Substituting = j6/{2{cS — 6)} in the above display and bounding jl > {jjey, 


^1/4 r j5 \ 


c5 

cS-b) 


.-jS/2 


= 23^ 1'^ 


Now choose b = 6e/{c(e — 2)}, so that cbjicb — b) = e/2. Then we have |(/>j+i(x)| 
since j6/2 < atb/2 < ctb/2 < bt. 


< 

rs-; 


B Some useful results 


Some matrix inequalities. Proofs can be found in standard texts; see for example, [3]. 
Lemma B.l. For any two matrices A,B, 



(i) 

..ln(.4)||Bj|^ < ||AB||, < IpiyiBlI,. 

(ii) 

If 5min(-^) ^ ||'®||2^ thcfl 


5min(^ B) ^ '5niin(^) 

(iii) 


A version of Anderson’s lemma from [30] which provides a sharp bound on the prob¬ 
ability of shifted balls under multivariate Gaussian distributions in terms of the centered 
probability and the size of the shift. 

Lemma B.2. Suppose ^ ~ A„(0,S) with S p.d. and G M”. Let ||^o|Ih ~ 

Then, for any t > 0, 

^(||C - C0II2 <t)> e"" < ^ 2 ). 


The Dirichlet integral formula (formula 4.635 in [18]) to simplify integrals over the unit 
probability simplex. 


Lemma B.3. Let y{-) be a Lebesgue integrable function and aj > 0, j = 1,, n. Then, 






nu r(«. 


J=1 


r(Eyi«.) Ji=i 
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